<<<<<<< HEAD
=======
<<<<<<<< HEAD:docs/lectures/lecture_13/13_03_nested_anovat_visual_slides.html
# Load necessary librarieslibrary(tidyverse)library(patchwork)library(cowplot)library(car)library(lme4)library(lmerTest)library(emmeans)library(knitr)library(viridis)# Set theme for consistent plottingtheme_set(theme_cowplot())# Custom theme for consistent plottingmy_theme <-theme_minimal() +theme(legend.position ="none",plot.title =element_text(face ="bold", size =10),axis.title =element_text(size =9),axis.text =element_text(size =8) )
Introduction
This document provides a comparison between one-way ANOVA and nested ANOVA using the andrew dataset from Quinn & Keough (2002). This tutorial aims to help students understand how the partitioning of variance differs between these two approaches and why accounting for nested factors is crucial in certain experimental designs.
In the andrew dataset, the experimental design consists of:
Four urchin density treatments (TREAT): Control, 66% Density, 33% Density, and Removed
Each treatment was replicated within four random patches (PATCH)
Five replicate quadrats were measured within each treatment-patch combination
The response variable is percentage cover of filamentous algae
Dataset Overview
First, let’s load and explore the dataset:
# Read in the andrew datasetandrew <-read_csv("data/andrew.csv")# Convert TREAT to factor with meaningful labelsandrew$TREAT <-factor(andrew$TREAT, levels =c("con", "t0.66", "t0.33", "rem"),labels =c("Control", "66% Density", "33% Density", "Removed"))# Convert PATCH to factorandrew$PATCH <-factor(andrew$PATCH)# Display the first few rows of the datasethead(andrew)
# A tibble: 6 × 4
TREAT PATCH QUAD ALGAE
<fct> <fct> <dbl> <dbl>
1 Control 1 1 0
2 Control 1 2 0
3 Control 1 3 0
4 Control 1 4 6
5 Control 1 5 2
6 Control 2 1 0
Important Note: The ANOVA table above does not use the correct error terms for testing the treatment effect. In a nested design, the treatment effect should be tested against the patch variation, not the residual error.
Corrected Nested ANOVA with Proper F-tests
Let’s calculate the correct F-ratios and p-values for the nested design:
Source df SS MS F p
1 Treatment 3 14429.14 4809.712 2.717102 9.13e-02
2 Patches (within treatment) 12 21241.95 1770.162 5.928207 <0.001
3 Residual 64 19110.40 298.600 NA <NA>
With the corrected nested ANOVA, we find:
The treatment effect is not significant (F = 2.72, p = 0.0913) when tested against the patch variation.
There is significant variation among patches within treatments (F = 5.93, p < 0.001)
This is a different conclusion than the one-way ANOVA, which found a significant treatment effect.
Variance Decomposition Comparison
Visual Decomposition of Variance Components
First, let’s create a visual representation of how variance is partitioned in a standard one-way ANOVA, and then contrast it with how a nested ANOVA further divides the variance components.
[1] "Treatment means:"
# A tibble: 4 × 2
TREAT treat_mean
<fct> <dbl>
1 Control 1.3
2 66% Density 21.6
3 33% Density 19
4 Removed 39.2
[1] "First few rows of joined patch_means:"
# A tibble: 6 × 4
TREAT PATCH patch_mean treat_mean
<fct> <fct> <dbl> <dbl>
1 Control 1 1.6 1.3
2 Control 2 0 1.3
3 Control 3 1 1.3
4 Control 4 2.6 1.3
5 66% Density 5 28.4 21.6
6 66% Density 6 36.8 21.6
[1] "treat_mean column is present in patch_means"
The plots above visually demonstrate the key differences in how variance is partitioned between one-way and nested ANOVA:
One-way ANOVA (first plot):
Total variance is split into just two components: Among Groups (Treatment) and Within Groups (Error)
The Within Groups component includes all variation not explained by treatments
Nested ANOVA (second plot):
Total variance is split into three components: Among Treatments, Among Patches within Treatments, and Within Patches (Residual Error)
The important addition is the “Among Patches within Treatments” component, which captures the spatial heterogeneity
The actual residual error (Within Patches) is smaller than the Within Groups error in one-way ANOVA
This visualization demonstrates why we get different conclusions: in one-way ANOVA, the patch-to-patch variation is incorrectly included in the error term, leading to an artificially inflated F-ratio for treatments.
Numerical Decomposition of Variance
# Calculate sums of squares for both modelsSS_Total <-sum(nested_summary[[1]][, "Sum Sq"])SS_Treatment <- nested_summary[[1]]["TREAT ", "Sum Sq"]SS_Patch_within_Treatment <- nested_summary[[1]]["TREAT:PATCH", "Sum Sq"]SS_Error_Nested <- nested_summary[[1]]["Residuals", "Sum Sq"]SS_Error_OneWay <- oneway_summary[[1]]["Residuals", "Sum Sq"]# Calculate percentages for visualizationpercent_treatment_oneway <- (SS_Treatment / SS_Total) *100percent_error_oneway <- (SS_Error_OneWay / SS_Total) *100percent_treatment_nested <- (SS_Treatment / SS_Total) *100percent_patch_nested <- (SS_Patch_within_Treatment / SS_Total) *100percent_error_nested <- (SS_Error_Nested / SS_Total) *100# Create data frame for visualizationss_comparison <-data.frame(Model =c(rep("One-way ANOVA", 2), rep("Nested ANOVA", 3) ),Source =c("Treatment", "Error (within)","Treatment", "Patch(Treatment)", "Error" ),Percent =c( percent_treatment_oneway, percent_error_oneway, percent_treatment_nested, percent_patch_nested, percent_error_nested ),SS =c( SS_Treatment, SS_Error_OneWay, SS_Treatment, SS_Patch_within_Treatment, SS_Error_Nested ))# Add factor levels for orderingss_comparison$Source <-factor( ss_comparison$Source,levels =c("Treatment", "Patch(Treatment)", "Error", "Error (within)"))# Create colors for the sourcessource_colors <-c("Treatment"="#1b9e77", "Patch(Treatment)"="#d95f02", "Error"="#7570b3","Error (within)"="#7570b3")# Create the stacked bar plotp1 <-ggplot(ss_comparison, aes(x = Model, y = Percent, fill = Source)) +geom_bar(stat ="identity", position ="stack", color ="black") +scale_fill_manual(values = source_colors) +labs(title ="Comparison of Variance Partitioning",subtitle ="One-way vs. Nested ANOVA",x ="",y ="Percentage of Total Variance",fill ="Source of Variation" ) +geom_text(aes(label =sprintf("%.1f%%", Percent)),position =position_stack(vjust =0.5),color ="white",fontface ="bold" ) +theme(plot.title =element_text(face ="bold", size =14),legend.position ="right",axis.text.x =element_text(face ="bold", size =12) )# Create a pie chart version for one-way ANOVAoneway_data <- ss_comparison %>%filter(Model =="One-way ANOVA")oneway_data$ypos <-cumsum(oneway_data$Percent) -0.5* oneway_data$Percentp2 <-ggplot(oneway_data, aes(x ="", y = Percent, fill = Source)) +geom_bar(stat ="identity", width =1, color ="black") +coord_polar("y", start =0) +labs(title ="One-way ANOVA",subtitle ="Variance Components",fill ="Source of Variation" ) +scale_fill_manual(values = source_colors) +geom_text(aes(y = ypos, label =sprintf("%.1f%%", Percent)),color ="white",fontface ="bold" ) +theme_void() +theme(plot.title =element_text(face ="bold", size =14, hjust =0.5),plot.subtitle =element_text(hjust =0.5),legend.position ="bottom" )# Create a pie chart version for nested ANOVAnested_data <- ss_comparison %>%filter(Model =="Nested ANOVA")nested_data$ypos <-cumsum(nested_data$Percent) -0.5* nested_data$Percentp3 <-ggplot(nested_data, aes(x ="", y = Percent, fill = Source)) +geom_bar(stat ="identity", width =1, color ="black") +coord_polar("y", start =0) +labs(title ="Nested ANOVA",subtitle ="Variance Components",fill ="Source of Variation" ) +scale_fill_manual(values = source_colors) +geom_text(aes(y = ypos, label =sprintf("%.1f%%", Percent)),color ="white",fontface ="bold" ) +theme_void() +theme(plot.title =element_text(face ="bold", size =14, hjust =0.5),plot.subtitle =element_text(hjust =0.5),legend.position ="bottom" )# Display all plotsp1
# Combine the pie chartsp2 + p3 +plot_layout(ncol =2)
Mixed-Effects Model Approach
A modern way to analyze nested designs is to use mixed-effects models. Let’s compare the results with our previous analyses:
# Fit linear mixed model with PATCH nested in TREATmixed_model <-lmer(ALGAE ~ TREAT + (1|TREAT:PATCH), data = andrew)# Show model summarysummary(mixed_model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: ALGAE ~ TREAT + (1 | TREAT:PATCH)
Data: andrew
REML criterion at convergence: 682.2
Scaled residuals:
Min 1Q Median 3Q Max
-1.9808 -0.3106 -0.1093 0.2831 2.5910
Random effects:
Groups Name Variance Std.Dev.
TREAT:PATCH (Intercept) 294.3 17.16
Residual 298.6 17.28
Number of obs: 80, groups: TREAT:PATCH, 16
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.300 9.408 12.000 0.138 0.8924
TREAT66% Density 20.250 13.305 12.000 1.522 0.1539
TREAT33% Density 17.700 13.305 12.000 1.330 0.2081
TREATRemoved 37.900 13.305 12.000 2.849 0.0147 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) TREAT6D TREAT3D
TREAT66%Dns -0.707
TREAT33%Dns -0.707 0.500
TREATRemovd -0.707 0.500 0.500
# ANOVA-style resultsanova(mixed_model, type =3, ddf ="Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
TREAT 2434 811.33 3 12 2.7171 0.09126 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The mixed model approach gives us similar conclusions to the correctly specified nested ANOVA. The treatment effect is non-significant when accounting for the nested structure of patches within treatments.
Visualizing the Nested Structure
One way to understand why we get different results is to visualize the data by patch within treatment:
# Calculate means for each patchpatch_means <- andrew %>%group_by(TREAT, PATCH) %>%summarize(ALGAE_mean =mean(ALGAE), .groups ="drop")# Plot patch meansggplot() +# Plot patch meansgeom_point(data = patch_means, aes(x = TREAT, y = ALGAE_mean, color = PATCH),position =position_jitterdodge(jitter.width =0.1, dodge.width =0.7),size =3) +# Add treatment meansgeom_point(data = summary_stats, aes(x = TREAT, y = mean),size =5, shape =18, color ="black") +# Add error bars for treatment meansgeom_errorbar(data = summary_stats,aes(x = TREAT, ymin = mean - se, ymax = mean + se),width =0.2, color ="black", linewidth =1) +# Plot stylingscale_color_viridis_d(option ="plasma", end =0.9) +labs(title ="Nested Structure: Patch Means within Treatments",subtitle ="Large symbols show treatment means (± SE)",x ="Urchin Density Treatment",y ="Mean Filamentous Algae Cover (%)",color ="Patch" ) +theme(legend.position ="right",plot.title =element_text(face ="bold") )
This plot clearly shows the high variation among patches within each treatment. This patch-to-patch variation contributes substantially to the total variance, which is not accounted for in the one-way ANOVA.
F-ratio Demonstration
Let’s illustrate how the F-ratio for treatments differs between one-way and nested ANOVA:
# Create a data frame for comparisonf_ratio_comparison <-data.frame(Model =c("One-way ANOVA", "Nested ANOVA"),F_numerator =c("MS Treatment", "MS Treatment"),F_denominator =c("MS Residual", "MS Patch(Treatment)"),F_value =c(oneway_summary[[1]][1, "F value"], F_treat),p_value =c(oneway_summary[[1]][1, "Pr(>F)"], p_treat))# Display the comparisonf_ratio_comparison
Model F_numerator F_denominator F_value p_value
1 One-way ANOVA MS Treatment MS Residual 9.058658 3.362391e-05
2 Nested ANOVA MS Treatment MS Patch(Treatment) 2.717102 9.126200e-02
Key Differences and Implications
The comparison between one-way ANOVA and nested ANOVA reveals several important differences:
Variance Partitioning:
In one-way ANOVA, all variation not explained by treatments is lumped into the “Error” term.
In nested ANOVA, this variation is partitioned into “Patch(Treatment)” and “Error” components.
F-ratio for Testing Treatment Effects:
One-way ANOVA tests treatments against residual error (MS Treatment / MS Residual).
Nested ANOVA tests treatments against patch variation (MS Treatment / MS Patch(Treatment)).
Nested ANOVA reveals that most variation is among patches, with non-significant treatment effects (p = 0.0913).
The substantial patch variability (38.8% of total variation) masks the treatment effect when properly accounted for.
Statistical Power and Type I Error:
Ignoring the nested structure leads to pseudoreplication and inflated Type I error rates.
The one-way ANOVA effectively treats each quadrat as an independent sample, inflating the degrees of freedom for the error term.
Conclusion
This demonstration illustrates why accounting for hierarchical or nested structures in experimental designs is crucial for valid statistical inference. Failure to account for such structures can lead to:
Pseudoreplication (treating non-independent samples as independent)
Inflation of Type I error rates (finding spurious “significant” effects)
Incorrect partitioning of variance sources
Misleading biological interpretations
In the andrew dataset, the nested ANOVA reveals that spatial heterogeneity (patch-to-patch variation) is the dominant factor influencing algae cover, not the experimental treatment. This insight would be missed if only a one-way ANOVA were used.
Alternative Code for Mixed Models
For advanced users, we could also fit this as a mixed model with lmerTest:
library(lmerTest)# Mixed model with patches nested within treatmentsmixed_model_alt <-lmer(ALGAE ~ TREAT + (1|TREAT:PATCH), data = andrew)anova(mixed_model_alt, type =3, ddf ="Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
TREAT 2434 811.33 3 12 2.7171 0.09126 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# We could also try a simpler random effects structuresimple_mixed_model <-lmer(ALGAE ~ TREAT + (1|PATCH), data = andrew)anova(simple_mixed_model, type =3, ddf ="Satterthwaite")
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
TREAT 2434 811.33 3 12 2.7171 0.09126 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Mixed models provide a more flexible approach to handle nested designs and are the recommended approach in modern statistical practice.
This lecture covers two fundamental statistical techniques in biology: correlation and regression analysis. Based on Chapters 16-17 from Whitlock & Schluter’s The Analysis of Biological Data (3rd edition), we’ll explore:
Correlation analysis: measuring relationships between variables
The distinction between correlation and regression
Simple linear regression: predicting one variable from another
Estimating and interpreting regression parameters
Testing assumptions and handling violations
Analysis of variance in regression
Model selection and comparison
Lecture 9: Correlation vs. Regression:
What’s the Difference?
Correlation Analysis:
Measures the strength and direction of a relationship between two numerical variables
Both X and Y are random variables (both measured, neither manipulated)
Variables are typically on equal footing (either could be X or Y)
No cause-effect relationship implied
Quantifies the degree to which variables are related
Expressed as a correlation coefficient (r) from -1 to +1
Regression Analysis:
Predicts one variable (Y) from another (X)
X is often fixed or controlled (manipulated)
Y is the response variable of interest
Often implies a cause-effect relationship
Produces an equation for prediction
Estimates slope and intercept parameters
Lecture 9: Correlation Analysis
What Is Correlation?
Correlation analysis measures the strength and direction of a relationship between two numerical variables:
Ranges from -1 to +1
+1 indicates perfect positive correlation
0 indicates no correlation
-1 indicates perfect negative correlation
The Pearson correlation coefficient (r) is defined as:
Where \(s_X\) and \(s_Y\) are the standard deviations of X and Y.
Lecture 9: Correlation Analysis
Example 16.1: Flipping the Bird
Nazca boobies (Sula granti) - Do aggressive behaviors as a chick predict future aggressive behavior as an adult?
correlation is r = 0.534 - moderate positive relationship
p-value = 0.007 correlation is statistically significant.
For a Pearson correlation coefficient (r) of 0.53372:
This is r (not rho as Spearman nonparticipant below), as indicated by “cor” in your output
To determine the amount of variation explained, you square this value: r² = 0.53372² = 0.2849 (or approximately 28.49%)
means about 28.49% of the variance in one variable can be explained by the other variable
Note \(\text{t}=\frac{r}{SE_r}\)
[1] 0.5337225
Pearson's product-moment correlation
data: booby_data$visits_as_nestling and booby_data$future_aggression
t = 2.9603, df = 22, p-value = 0.007229
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1660840 0.7710999
sample estimates:
cor
0.5337225
Lecture 9: Correlation Analysis
Example 16.1: Flipping the Bird
Interpretation: The correlation coefficient of r = 0.534 suggests that Nazca boobies who experienced more visits from non-parent adults as nestlings tend to display more aggressive behavior as adults. This supports the hypothesis that early experiences influence adult behavior patterns in this species.
Standard Error:
\(\text{SE}_r = \sqrt{\frac{1-r^2}{n-2}}\)
SE = 0.180
Need to be sure relationship is not curved - note below
Lecture 9: Correlation Analysis
Testing Assumptions for Correlation
As described in Section 16.3, correlation analysis has key assumptions:
Random sampling: Observations should be a random sample from the population
Bivariate normality: Both variables follow a normal distribution, and their joint distribution is bivariate normal
Linear relationship: The relationship between variables is linear, not curved
Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:
Shapiro-Wilk normality test
data: lion_data$proportion_black
W = 0.88895, p-value = 0.003279
Shapiro-Wilk normality test
data: lion_data$age_years
W = 0.87615, p-value = 0.001615
Lecture 9: Correlation Analysis
Testing Assumptions for Correlation
As described in Section 16.3, correlation analysis has key assumptions:
Random sampling: Observations should be a random sample from the population
Bivariate normality: Both variables follow a normal distribution, and their joint distribution is bivariate normal
Linear relationship: The relationship between variables is linear, not curved
Let’s check these assumptions using the lion data from Example 17.1 Lion Noses:
Lecture 9: Correlation Analysis
What to do if assumptions are violated:
Transform one or both variables (log, square root, etc.)
Use non-parametric correlation (Spearman’s rank correlation) or Kendall’s tau 𝛕
Examine the data for outliers or influential points
To understand the amount of variation explained, you can square the Spearman’s rho value.
For your value of 0.74485:
ρ² = 0.74485² = 0.5548
This means approximately 55.48% of the variance in ranks of one variable can be explained by the ranks of the other variable. This is similar to how R² works in linear regression, but specifically for ranked data.
Spearman's rank correlation rho
data: lion_data$proportion_black and lion_data$age_years
S = 1392.1, p-value = 1.013e-06
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7448561
Lecture 9: Correlation Analysis
Correlation: Important Considerations
The correlation coefficient depends on the range
Restricting range of values can reduce the correlation coefficient
Comparing correlations between studies requires similar ranges of values
Measurement error affects correlation
Measurement error in X or Y tends to weaken observed correlation
This bias is called attenuation
True correlation typically stronger than observed correlation
Correlation vs. Causation
Correlation does not imply causation
Three possible explanations for correlation:
X causes Y
Y causes X
Z (a third variable) causes both X and Y
Correlation significance test
H₀: ρ = 0 (no correlation in population)
H₁: ρ ≠ 0 (correlation exists in population)
Test statistic: t = r / SE(r) with df = n-2
Lecture 9: Linear Regression
Simple Linear Regression Model
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The population regression model \[Y = \alpha + \beta X + \varepsilon\]
Where:
Y is the response variable
X is the predictor variable
α (alpha) is the intercept (value of Y when X=0)
β (beta) is the slope (change in Y per unit change in X)
ε (epsilon) is the error term (random deviation from the line)
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
\(\hat{Y}\) is the predicted value of Y
a is the estimate of α (intercept)
b is the estimate of β (slope)
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Lecture 9: Linear Regression
Simple Linear Regression Model
Simple linear regression models the relationship between a response variable (Y) and a predictor variable (X).
The population regression model is:
\[Y = \alpha + \beta X + \varepsilon\]
Where:
Y is the response variable
X is the predictor variable
α (alpha) is the intercept (value of Y when X=0)
β (beta) is the slope (change in Y per unit change in X)
ε (epsilon) is the error term (random deviation from the line)
The sample regression equation is:
\[\hat{Y} = a + bX\]
Where:
\(\hat{Y}\) is the predicted value of Y
a is the estimate of α (intercept)
b is the estimate of β (slope)
Method of Least Squares: The line is chosen to minimize the sum of squared vertical distances (residuals) between observed and predicted Y values.
Lecture 9: Linear Regression
From Example 17.1 in the textbook the regression line for the lion data is:
This means: - When a lion has no black on its nose (proportion = 0), its predicted age is 0.88 years - For each 0.1 increase in the proportion of black, age increases by 1.065 years - The slope (10.65) indicates that lions with more black on their noses tend to be older
Lecture 9: Linear Regression
Simple Linear Regression Model
male lions develop more black pigmentation on their noses as they age.
can be used to estimate the age of lions in the field.
Call:
lm(formula = age_years ~ proportion_black, data = lion_data)
Residuals:
Min 1Q Median 3Q Max
-2.5449 -1.1117 -0.5285 0.9635 4.3421
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8790 0.5688 1.545 0.133
proportion_black 10.6471 1.5095 7.053 7.68e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.669 on 30 degrees of freedom
Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
Lecture 9: Linear Regression
Simple Linear Regression Model
The calculation for slope (b) is: \[b = \frac{\sum_i(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_i(X_i - \bar{X})^2}\]
Call:
lm(formula = log_stability ~ species_number, data = prairie_data)
Residuals:
Min 1Q Median 3Q Max
-0.82774 -0.25344 -0.00426 0.27498 0.75240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.252629 0.041023 30.535 < 2e-16 ***
species_number 0.025984 0.004926 5.275 4.28e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3433 on 159 degrees of freedom
Multiple R-squared: 0.149, Adjusted R-squared: 0.1436
F-statistic: 27.83 on 1 and 159 DF, p-value: 4.276e-07
[1] "rsquared is: 0.148953385305455"
Analysis of Variance Table
Response: log_stability
Df Sum Sq Mean Sq F value Pr(>F)
species_number 1 3.2792 3.2792 27.829 4.276e-07 ***
Residuals 159 18.7358 0.1178
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lecture 9: Linear Regression
The hypothesis test asks whether the slope equals zero:
H₀: β = 0 (species number does not affect stability)
H₁: β ≠ 0 (species number does affect stability)
The test statistic is: \(t = \frac{b - \beta_0}{SE_b}\)
With df = n - 2 = 161 - 2 = 159
Interpretation:
The slope estimate is 0.033, indicating that log stability increases by 0.033 units for each additional plant species in the plot.
The p-value is very small (2.73e-10), providing strong evidence to reject the null hypothesis that species number has no effect on ecosystem stability.
R² = 0.222, meaning that approximately 22.2% of the variation in log stability is explained by the number of plant species.
This supports the biodiversity-stability hypothesis: more diverse plant communities have more stable biomass production over time.
Lecture 9: Linear Regression
Testing Regression Assumptions
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
Let’s check these assumptions for the lion regression model:
Assume that error 𝞮 is \(e_i = y_i - \hat{y}_i\)
normally distributed for each xi
has the same variance
has a mean of 0 at each xi
Lecture 9: Linear Regression
Testing Regression Assumptions
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
Let’s check these assumptions for the lion regression model:
Assume that error 𝞮 is - estimated as the residuals: \(e_i = y_i - \hat{y}_i\)
ordinary lease square estimates a and b or slope and intercept to minimize the sum of the residuals squared or Mean Squared Error (MSE) as
\(\sum_{i=1}^{n} = (y_i - \hat{y}_i)^2\)
Lecture 9: Linear Regression
Testing Regression Assumptions
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
Let’s check these assumptions for the lion regression model:
Lecture 9: Linear Regression
Testing Regression Assumptions
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
Let’s check these assumptions for the lion regression model:
Lecture 9: Linear Regression
Testing Regression Assumptions
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
Let’s check these assumptions for the lion regression model:
Shapiro-Wilk normality test
data: residuals(lion_model)
W = 0.93879, p-value = 0.0692
Lecture 9: Linear Regression
Simple Linear Regression Model
linear regression has four key assumptions:
Linearity: The relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Equal variance across all values of X
Normality: Residuals are normally distributed
If assumptions are violated: 1. Transform the data (Section 17.6) 2. Use weighted least squares for heteroscedasticity 3. Consider non-linear models (Section 17.8)
Lecture 9: Linear Regression - estimates of error and significance
Estimates of standard error and confidence intervals for slow and intercept to determine confidence bands
the 95% confidence band will contain the true population line 95/100 under repeated sampling
this is usually done in R
Lecture 9: Linear Regression - estimates of error and significance
In addition to getting estimates of population parameters (β0 , β1), want to test hypotheses about them
This is accomplished by analysis of variance
Partition variance in Y: due to variation in X, due to other things (error)
Lecture 9: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
\(SS_{regression}\): variation explained by regression
difference between predicted values (ŷi ) and mean y (ȳ)
dfs= 1 for simple linear (parameters-1)
\(SS_{residual}\): variation not explained by regression
difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
dfs= n-2
\(SS_{total}\): total variation
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Lecture 9: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
\(SS_{regression}\): variation explained by regression
difference between predicted values (ŷi ) and mean y (ȳ)
dfs= 1 for simple linear (parameters-1)
\(SS_{residual}\): variation not explained by regression
difference between observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values
dfs= n-2
\(SS_{total}\): total variation
sum of squared deviations of each observation (\(y_i\)) from mean (\(\bar{y}\))
dfs = n-1
Lecture 9: Linear Regression - estimates of variance
Total variation in Y is “partitioned” into 3 components:
\(SS_{regression}\): variation explained by regression
GREATER IN C than D
\(SS_{residual}\): variation not explained by regression
GREATER IN B THAN A
\(SS_{total}\): total variation
Lecture 9: Linear Regression - estimates of variance
Sums of Squares and degress of freedome are:
\(SS_{regression} +SS_{residual} = SS_{total}\)
\(df_{regression}+df_{residual} = df_{total}\)
Sums of Squares depends on n
We need a different estimate of variance
Lecture 9: Linear Regression - estimates of variance
Sums of Squares converted to Mean Squares
Sums of Squares divided by degrees of freedom - does not depend on n
\(MS_{residual}\): estimate population variation
\(MS_{regression}\): estimate pop variation and variation due to X-Y relationship
Mean Squares are not additive
<<<<<<< HEAD
=======
>>>>>>>> origin/main:docs/lectures/lecture_09/09_01_lecture_powerpoint_slides.html
>>>>>>> origin/main